Level of Confidence Study for Roll-back Recovery with Checkpointing
نویسندگان
چکیده
Increasing soft error rates for semiconductor devices manufactured in later technologies enforces the use of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). However, RRC introduces time overhead that increases the completion (execution) time. For nonreal-time systems, research have focused on optimizing RRC and shown that it is possible to find the optimal number of checkpoints such that the average execution time is minimal. While minimal average execution time is important, it is for real-time systems important to provide a high probability of meeting given deadlines. Hence, there is a need of probabilistic guarantees that jobs employing RRC complete before a given deadline. Therefore, in this paper we present a mathematical framework for the evaluation of level of confidence, the probability that a given deadline is met, when RRC is employed.
منابع مشابه
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tolerance means. In order to mitigate the effect of residual software faults at low performance cost...
متن کاملA Novel Roll - Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery
متن کامل
Improvements to a Roll-Back Mechanism for Asynchronous Checkpointing and Recovery
Gupta, Rahimi and Yang recently proposed a novel recovery algorithm for distributed systems in which checkpoints are taken asynchronously [1]. A checkpoint taken by a process is a snapshot of its local state, stored in a stable storage, so that the process can roll back to it, if this becomes necessary. The start of a process is also one of its checkpoints. Asynchronous checkpointing means that...
متن کاملPerformability Modeling of Coordinated Software and Hardware Fault Tolerance
Quantitative system evaluation concerning both software design faults and hardware operational faults has not yet received enough attention. Although a few studies considered such dependability analysis, analytic evaluation of systems with integrated software and hardware fault tolerance remains a challenge. In particular, the need to distinguish the effects of software design faults from those...
متن کاملComprehensive Low-overhead Process Recovery Based on Quasi-synchronous Checkpointing
In this paper, we propose a low-overhead recovery algorithm based on a quasi-synchronous checkpointing algorithm. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easen...
متن کامل